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Abstract 

This  paper  proposes  a  novel  approach  to  action  recog¬ 
nition  from  RGB-D  cameras,  in  which  depth  features  and 
RGB  visual  features  are  jointly  used.  Rich  heterogeneous 
RGB  and  depth  data  are  effectively  compressed  and  pro¬ 
jected  to  a  learned  shared  space,  in  order  to  reduce  noise 
and  capture  useful  information  for  recognition.  Knowledge 
from  various  sources  can  then  be  shared  with  others  in  the 
learned  space  to  learn  cross-modal  features.  This  guides 
the  discovery  of  valuable  information  for  recognition.  To 
capture  complex  spatiotemporal  structural  relationships  in 
visual  and  depth  features,  we  represent  both  RGB  and  depth 
data  in  a  matrix  form.  We  formulate  the  recognition  task 
as  a  low-rank  bilinear  model  composed  of  row  and  column 
parameter  matrices.  The  rank  of  the  model  parameter  is 
minimized  to  build  a  low -rank  classifier,  which  is  benefi¬ 
cial  for  improving  the  generalization  power.  The  proposed 
method  is  extensively  evaluated  on  two  public  RGB-D  ac¬ 
tion  datasets,  and  achieves  state-of-the-art  results.  It  also 
shows  promising  results  if  RGB  or  depth  data  are  missing 
in  training  or  testing  procedure. 


1.  Introduction 

Action  recognition  from  RGB-D  cameras  has  been  re¬ 
ceiving  increasing  interests  in  the  computer  vision  commu¬ 
nity  due  to  the  recent  advance  of  easy-to-use  and  low-cost 
depth  sensors  such  as  Kinect  sensors  na.  In  addition  to 
RGB  visual  data  captured  by  conventional  RGB  cameras, 
depth  data  are  provided  in  RGB-D  cameras,  encoding  rich 
3D  structural  information  of  the  entire  scene.  Previous  work 
(Ml  H3  ED  0  showed  that  effective  usage  of  3D  struc¬ 
tural  information  facilitates  recognition  tasks  as  it  simplifies 
intra-class  motion  variations  and  removes  cluttered  back¬ 
ground  noise. 

Despite  its  effectiveness,  those  methods  are  only  appli¬ 
cable  when  depth  data  are  available.  Methods  developed 
in  E5HI31E310  are  particularly  designed  for  depth  data, 
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Figure  1.  Our  method  projects  and  compresses  both  RGB  visual 
features  and  depth  features  to  a  learned  shared  feature  space.  Clas¬ 
sification  boundaries  are  learned  in  the  shared  space  for  action 
recognition.  This  process  iterates  until  convergence. 


and  thus  would  fail  if  depth  data  are  unavailable  or  missing 
in  RGB-D  cameras.  In  addition,  depth  data  are  noisy  due 
to  spatiotemporal  discontinuous  regions.  This  hinders  the 
application  of  feature  extraction  methods  such  as  surface 
normal  Gsi  cca  and  spatiotemporal  interest  points  mm 
in  these  regions.  If  the  discontinuous  regions  unfortunately 
appear  in  the  body  parts  that  were  supposed  to  provide  dis¬ 
criminative  cues,  such  as  arms  or  legs,  the  recognition  per¬ 
formance  will  be  undoubtedly  degraded  in  case  of  having 
depth  information  as  a  single  cue. 

RGB  data  and  depth  data  can  be  complementary  to  each 
other  if  one  of  them  is  missing.  Implicit  correlations  be¬ 
tween  them  can  be  learned  to  handle  the  case  that  one  of 
them  is  unavailable.  Moreover,  RGB  data  are  robust  with  no 
discontinuities.  Numerous  feature  descriptors  (e.g.  gradient 
and  optical  flow)  can  be  extracted  from  RGB  data,  provid¬ 
ing  abundant  and  robust  features  for  recognition  tasks. 

Furthermore,  human  bodies  consist  of  multiple  structural 
objects,  and  thus  motions  of  human  body  parts  are  highly 
correlated.  Existing  work  for  action  recognition  from  depth 
sequences  Gsma  attempted  to  capture  spatiotemporal  cor¬ 
relation  information  of  body  part  movements  by  aggregat¬ 
ing  features  from  neighborhoods.  However,  this  informa¬ 
tion  would  unfortunately  collapse  as  co-occurrence  features 
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are  concatenated  into  high  dimensional  vectors  lUH . 

In  this  paper,  we  propose  a  novel  bilinear  heterogeneous 
information  machine  (BHIM)  for  action  recognition  from 
RGB-D  sequences.  BHIM  learns  cross-modal  features  that 
effectively  capture  heterogeneous  visual  and  depth  infor¬ 
mation.  RGB  and  depth  data  are  treated  as  two  modali¬ 
ties  in  this  work.  We  project  the  original  features  of  the 
two  modalities  onto  a  shared  space,  and  learn  cross-modal 
features  shared  between  them  for  classification  in  order  to 
effectively  capture  cross-modal  knowledge.  The  learned 
cross-modal  features  inherit  the  characteristics  of  both  RGB 
and  depth  data  that  capture  motion,  3D  structural,  and  spa- 
tiotemporal  relationship  information.  Moreover,  the  fea¬ 
tures  are  “filtered”  for  noise  removal  in  the  projection  pro¬ 
cedure.  We  show  in  the  experiment  that  the  learned  cross- 
modal  features  are  expressive  and  discriminative  for  differ¬ 
entiating  action  categories,  even  if  one  modality  is  missing 
in  training  or  testing. 

We  represent  both  visual  and  depth  features  in  a  ma¬ 
trix  form,  which  naturally  encodes  spatiotemporal  structural 
relationships.  Even  though  feature  matrices  are  projected 
onto  a  low-dimensional  space,  the  structural  information 
of  body  parts  is  conserved  and  motion  information  is  com¬ 
pressed  and  denoised.  This  overcomes  the  aforementioned 
problem  of  the  collapsed  information  in  feature  vectors. 

The  recognition  problem  is  formulated  in  a  low-rank  bi¬ 
linear  framework,  particularly  designed  for  feature  repre¬ 
sentations  in  a  matrix  form.  The  proposed  model  learns 
feature  projection  matrices  and  a  classification  parameter 
matrix,  which  operate  as  feature  weighting  in  both  rows 
and  columns,  respectively.  The  projection  matrices  are  op¬ 
timized  to  map  original  heterogeneous  visual  and  depth  fea¬ 
tures  onto  a  shared  feature  space,  which  is  the  optimal  space 
for  building  robust  and  effective  cross-modal  features  for 
recognition.  An  information  measure  is  incorporated  in  the 
learning  of  projection  matrices  to  help  to  reduce  noise  in 
feature  projection  procedure.  Classification  is  performed 
using  the  learned  cross-modal  features.  The  rank  of  the 
model  is  minimized  from  the  viewpoint  of  generalization 
power  and  computational  cost  f22l . 

We  propose  an  efficient  algorithm  to  optimize  BHIM. 
Without  approximations  nor  hard  constraint  on  the  rank  of 
the  parameter  matrices,  we  present  a  regularized  risk  min¬ 
imization  problem  that  produces  low-rank  projection  ma¬ 
trices  and  an  action  classifier  by  minimizing  the  Frobenius 
norm  of  the  parameter  matrices.  This  allows  us  to  use  ex¬ 
isting  efficient  SVM  solvers.  The  learning  problem  is  itera¬ 
tively  solved  with  a  bundle  method  [  191  SI  being  the  solver 
for  the  inner  optimization  problem. 

The  main  contribution  of  this  work  is  the  BHIM,  a  novel 
formalism  for  RGB-D  action  recognition.  With  inputs  of 
feature  matrices  rather  than  vectors,  BHIM  keeps  inher¬ 
ent  spatiotemporal  structural  information  within  features, 


which  plays  a  key  role  in  recognition.  In  addition,  BHIM 
learns  a  shared  space  for  heterogeneous  data  (RGB  and 
depth  data  in  this  work),  where  knowledge  can  be  shared  be¬ 
tween  them.  BHIM  directly  minimizes  the  rank  of  param¬ 
eter  matrices,  and  produces  compact  yet  expressive  cross- 
modal  features  through  the  use  of  information  measure.  An 
efficient  solver  is  designed  for  BHIM  and  achieves  superior 
performance  over  state-of-the-art  methods. 

2.  Related  Work 

Previous  action  recognition  approaches  mainly  focus  on 
RGB  action  videos  El  El  HU  |6).  These  studies  used  low- 
level  interest  point  features  El,  mid-level  semantic  fea¬ 
tures  E)  or  human  pose  El  ,  or  learned  features  using  deep 
learning  technique  f6).  However,  misclassification  exists 
due  to  large  intra-class  variations  such  as  motion  and  pose. 

Due  to  the  advent  of  low-cost  Kinect  sensors  El,  lots  of 
attempts  have  been  devoted  to  object  recognition  012  and 
action  recognition  ifTOl  fl3l  25]  [5l  from  depth  images.  One 
of  the  main  advantages  of  depth  data  is  that  they  capture 
3D  structural  information,  which  helps  reduce  background 
noise,  and  simplifies  intra-class  variations.  Effective  fea¬ 
tures  have  been  proposed  for  recognition  from  depth  data, 
such  as  action  graph  (na>  histogram  of  oriented  4D  nor¬ 
mals  El,  super  normal  vector  [25 1,  4D  interest  point-based 
method  0,  and  depth  spatiotemporal  interest  points  l23ll. 
Features  from  depth  sequences  can  be  encoded  by  [12),  or 
be  used  to  build  actionlets  lf2Tl  for  recognition.  Recent  work 
m  also  showed  that  features  of  RGB-D  data  can  also  be 
learned  using  popular  deep  learning  techniques. 

Those  methods  only  use  depth  data,  and  thus  would  fail 
if  depth  data  are  missing.  In  contrast,  our  method  uses 
both  RGB  and  depth  data,  and  can  handle  the  case  if  one 
modality  is  missing.  Moreover,  they  use  features  in  a  vector 
form,  in  which  spatiotemporal  structures  would  easily  col¬ 
lapse  (18]  0.  In  this  work,  we  propose  to  use  features  in  a 
matrix  form,  which  naturally  captures  both  spatiotemporal 
structural  information  and  motion  information.  We  show  in 
the  experiment  that  features  in  a  matrix  format  significantly 
improve  the  performance  even  though  the  rank  of  the  pa¬ 
rameter  matrices  in  BHIM  is  constrained  to  be  1. 

Feature  learning  methods  EE!  [DEI  have  been  pro¬ 
posed  to  learn  better  feature  representations  for  recogni¬ 
tion.  Different  from  El,  we  elegantly  use  features  from 
two  modalities  for  recognition.  In  contrast  to  (8),  we  use 
the  Frobenius  norm  instead  of  the  trace  norm,  which  allows 
us  to  use  existing  efficient  SVM  solvers.  In  addition,  we 
use  an  effective  information  measure  to  produce  more  com¬ 
pact  cross-modal  features,  while  this  was  not  considered  in 
[  14,  8|.  Method  [24]  extends  information  bottleneck  [20]  to 
a  multi- view  model.  In  contrast  to  their  work,  we  learn  a 
low-rank  bilinear  model,  which  shows  better  generalization 
power  than  a  linear  model.  In  addition,  our  method  can  rec- 
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Figure  2.  Feature  matrix  of  size  nxyt  x  n/  is  constructed  from 
features  (e.g.,  HOG)  computed  on  all  the  frames.  nxyt  is  the  total 
number  of  pixels  in  all  the  feature  frames,  and  n/  is  the  dimen¬ 
sionality  of  each  local  feature. 


ognize  actions  if  one  modality  is  missing.  However,  those 
methods  were  not  designed  for  handling  missing  modality 
and  their  performance  is  not  clear. 

3.  Bilinear  Heterogeneous  Information  Ma¬ 
chine 

The  goal  of  this  work  is  to  utilize  heterogeneous  fea¬ 
tures  from  RGB-D  action  videos,  and  learn  shared  cross- 
modal  features  for  action  recognition.  Denote  N  RGB-D 
action  videos  for  training  purpose  by  {Xi,yi}f=1,  where 
Xi  =  { x\v ^ ,  Xf^ }  G  X  contains  a  RGB  visual  feature 

M  \z] 

matrix  X\  J  G  Xv  and  a  depth  feature  matrix  X\  J  G  Xz  ex¬ 
tracted  from  RGB-D  data,  and  yi  G  y  is  the  corresponding 
action  label.  Note  that  x\v ^  and  x\z ^  in  our  work  are  defined 
as  feature  matrices  of  size  nxyt  x  ny,  different  from  feature 
vectors  containing  nxyt  x  rtf  elements  that  are  popularly 
used  in  computer  vision  community.  In  this  work,  features 
x\v^  and  x\z^  (such  as  histogram  of  oriented  gradient)  are 
extracted  from  a  spatiotemporal  grid  of  nxyt  =  nx  x  ny  x  nt, 
and  ny  is  the  dimensionality  of  each  local  feature.  Action 
representation  in  a  matrix  form  allows  us  to  capture  inher¬ 
ent  structure  of  features,  such  as  spatiotemporal  relation¬ 
ships.  However,  these  relationships  are  collapsed  in  a  vec¬ 
tor  form  feature  representation.  Note  that  one  can  pull  out 
other  dimensions  rather  than  the  feature  dimension  in  X ■  1 
and  x\z^ ,  but  the  structure  of  nxyt  pixels  in  the  feature  ma¬ 
trices  will  not  be  conserved  by  the  proposed  model. 

RGB-D  action  data  Xi  contain  two  modalities,  visual 
features  x\v ^  and  depth  features  x\z\  The  major  challenge 
for  effectively  using  the  two-modality  features  is  that  they 
come  from  different  distributions,  and  thus  their  similari¬ 
ties  could  not  be  measured  directly.  To  solve  this  prob¬ 
lem,  we  would  like  to  learn  two  projection  functions  Pv 
and  Pz  for  visual  features  X\  J  and  depth  features  X\  J ,  re¬ 
spectively.  Each  of  the  projection  functions  maps  the  cor¬ 
responding  features  to  a  space  O  shared  between  the  two 
modalities:  Pv  :  Xv  — x  O ,  and  Pz  :  Xz  — X  O.  After  learn¬ 
ing  the  projection  functions,  a  classification  model  G  can 
be  learned  to  classify  actions  given  features  in  the  shared 


space:  G  :  O  — x  y. 

Instead  of  learning  the  projection  functions,  Pv  and  Pz , 
and  the  classification  function  G  independently,  we  are  in¬ 
terested  in  learning  these  functions  simultaneously.  There¬ 
fore,  the  learned  projections  are  optimized  for  classification. 
We  focus  on  learning  a  discriminant  function  F  :  X  xy  ^ 

1 Z  that  scores  each  training  sample  (Xi:yi).  The  function 
F  is  applied  to  compute  the  compatibility  between  original 
RGB-D  features  Xi  and  the  learned  cross-modal  features 
Oi ,  and  between  the  features  Oi  and  an  action  label  yi. 

3.1.  Model  Formulation 

Suppose  we  are  given  M  (M  =  2  in  this  work)  types 
of  modalities  xjm^=1.  Here,  m  is  the  index  of  modal¬ 
ity,  which  can  be  either  visual  (m  =  1)  or  depth  (m  =  2). 
We  represent  both  of  the  two  modality  features  in  a  ma¬ 
trix  form  in  order  to  keep  inherent  spatiotemporal  struc¬ 
ture.  In  this  paper,  we  are  interested  in  a  binary  bilin¬ 
ear  discriminant  function  F(Xi,  y\W)  =  Tr (WT  Xi)  = 
i  Tr(FF[m]T  xj77^ ),  which  is  a  family  of  bilinear  func¬ 
tions  parameterized  by  a  model  weight  matrix  W.  The 
one-vs-one  scheme  is  adopted  to  extend  our  binary  clas¬ 
sifier  to  a  multi-class  classifier.  One  of  the  challenges  in 
RGB-D  action  recognition  is  that  the  two  modalities,  RGB 
features  and  depth  features,  are  in  different  feature  spaces, 
and  thus  their  similarities  cannot  be  directly  computed.  We 
solve  this  problem  by  decomposing  the  parameter  matrix 
FJ/M  for  each  modality  into  two  components,  and 

Ww:  =  WwW^T  (see  Figure [ij).  Parameter  matrix 

Wf71^  G  'JZnfxd  (m  =  1,  •  •  •  ,  M)  projects  the  m  modality 
data,  X^m\  onto  a  learned  shared  space,  and  parameter  ma¬ 
trix  Ww  G  PJlxytXd  is  applied  to  classify  the  projected  data 
regardless  of  modalities.  Ww  is  a  spatiotemporal  template 
defined  over  d  features  at  each  spatiotemporal  location.  Ob¬ 
viously,  the  rank  of  the  model  parameter  matrix  will 
be  enforced  to  be  at  most  d. 

Once  the  optimal  model  parameter  matrix  W  is  learned 
from  training  data,  the  action  label  y*  can  be  computed  by 

y*  =  sign  (Tr(W^XO)  =  sign  (^Tr(W^m]W^x]ro1)), 

m 

(i) 

where  sign(-)  is  the  sign  function. 

We  train  the  bilinear  model  in  Eq.  0  in  a  max-margin 
framework.  Based  on  the  empirical  risk  minimization  prin¬ 
ciple,  we  formulate  our  learning  problem  as 

min  c/)(Wlf] ,W\Z])  +  A  •  r(Ww,  Wl/\  w\z]) 
Ww,wM,wM  (2) 

+  C-l(Ww,W[pW[/]), 

where  </>(•)  is  a  regularizer  term  for  reducing  noise  in  the 
projected  data,  r(-)  is  an  additional  regularizer  term  related 
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Figure  3.  Graphical  illustration  of  the  proposed  BHIM  model.  Pa¬ 
rameter  matrix  {m  —  1,  •  •  •  ,  M )  projects  the  m  modality 

data,  X  ^ ,  into  a  learned  shared  space,  and  Ww  is  applied  to  clas¬ 
sify  the  projected  data  regardless  of  modalities. 


to  the  margin  of  the  bilinear  model,  and  /(•)  computes  the 
training  loss  for  the  two-modality  data.  A  and  C  are  trade¬ 
off  parameters  balancing  the  importance  of  the  correspond¬ 
ing  terms. 

Regularizer  (j){W^\  W^)  is  a  function  that  attempts 
to  summarize  and  compress  the  original  two-modality  data. 
Since  the  raw  RGB  and  depth  data  may  not  be  in  the  same 
space,  we  use  this  term  to  compress  the  data  and  discover 
shared  knowledge  between  the  two  modalities.  We  define 
this  term  as 

4>(Wl/\w\z])  =  lXv\0)  +  I(xW,0),  (3) 

where  X (m  =  v  or  m  =  z )  rep- 
resents  a  set  of  all  training  samples  in  the  m  modality, 
O  =  +  X^w\z])  e  O  is  the  learned  low- 

dimensional  cross-modal  features  in  the  shared  space,  and 
/(•,•)  computes  mutual  information. 

Cross-modal  knowledge  can  be  introduced  to  the  model 
through  the  learning  of  the  intermediate  features  O.  Cross- 
modal  features  O  inherit  information  from  both  RGB  and 
depth  data,  including  motion,  3D  structural,  and  spatiotem- 
poral  relationship  information.  We  show  in  the  experiments 
that  the  learned  features  play  an  important  role  in  the  recog¬ 
nition  of  RGB-D  actions  and  in  case  of  missing  one  modal¬ 
ity  in  training  or  testing  phase. 

In  addition,  the  term  ,  W^)  helps  to  reduce  noise 

and  produces  a  compact  representation  for  cross-modal  fea¬ 
tures  O.  In  the  learning  of  cross-modal  features  O,  a  large 
amount  of  noise  irrelevant  to  action  labels  would  also  be  in¬ 
troduced  to  the  shared  space,  and  thus  degrades  the  recog¬ 
nition  performance.  By  minimizing  (j)(W^\w ’j^),  both 
noisy  and  discriminative  information  in  O  will  be  reduced, 
but  discriminative  information  can  be  well  captured  by  reg¬ 


ularizer  r(Ww ,  W ,  W^)  in  Eq.  ([4]).  Parameter  A  for  reg¬ 
ularizer  r(Ww ,  W^\  WfZ^ )  is  used  for  balancing  the  impor¬ 
tance  of  the  noise  filter  in  BHIM. 

Regularizer  r(Ww,W^\w^)  is  used  to  measure 
the  margin  of  the  bilinear  classifier.  Minimizing 
r(Ww,W^ ,  W^)  is  equivalent  to  maximizing  the  margin 
of  the  bilinear  model,  thereby  capturing  discriminative  in¬ 
formation.  We  define  this  term  as 

r{Ww,W[f],W[z])  =  Ixr  (WwW^]TW^W^) 

:  (4) 

+  -T \{WwW[z]TW[;]W^). 

Regularizer  term  r(WWl  ,  W^)  naturally  induces  a 
low-rank  classifier  with  the  maximum  rank  of  d.  This  re¬ 
stricts  the  degree  of  freedom  of  model  parameter  matrices. 
As  shown  in  (22),  the  VC-dimension  of  low-rank  classifica¬ 
tion  models  is  proved  to  be  less  than  that  of  the  concatenated 
linear  models. 

Regularizer  r(Ww ,  W^\w^)  is  minimized  to  extract 
discriminative  information  from  cross-modal  features  O  for 
action  recognition.  It  works  together  with  (j)(W^\w^)  in 
Eq.  ([3])  to  extract  discriminative  information  and  filter  out 
noise  for  recognition. 

Loss  function  l  ( Ww ,  computes  training 

loss  given  the  learned  model  parameter  matrices.  We  con¬ 
sider  a  binary  classifier  in  this  work,  and  define  a  hinge  loss 
function  for  each  modality,  which  is  similar  to  the  one  in 
the  binary  SVM: 


l(Ww,  W\v\  W[fz])  =  ^2  [max(0, 1  -  Vi  Tr(w\v]W^x\v]) 

i 

+  max(0, 1  -  Vi  Tr (w}2’  Wj xj*1 )  . 

(5) 

Plugging  Eq.  Eq.  0]),  and  Eq.  ([5])  into  Eq.  ([2]),  opti¬ 
mal  parameter  matrices  WfK  and  Ww  can  be  learned 
by  the  following  constrained  optimization  problem: 

min  Y,  \nX[m],0)  +  h  •  Tr(WwW[™]TW[™]WZ) 

W  W[*]  L  ^  JJ 

vvw,  vv  f  ,vv  f  m 

+c-£«!' 

i 

s.t.  yzT Y(W[™] Wj x]m] )  >  1  -  ^|m] ,  Vi,  Vm, 

^|ml  >  0,  Vi,  Vm, 

(6) 

where  ^  is  a  slack  variable  for  the  m  modality  in  the  i-th 
RGB-D  video. 


3.2.  Model  Learning 

The  above  constrained  optimization  problem  can  be 
solved  by  a  coordinate  descent  algorithm  that  solves  for  one 
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set  of  parameter  matrices  at  each  step  with  the  others  fixed. 
Each  step  in  the  algorithm  is  a  regularized  risk  minimiza¬ 
tion  problem,  which  can  be  solved  using  a  bundle  methocf] 
dill.  The  bundle  method  is  adopted  as  the  inner  problem 
solver  due  to  its  efficiency  and  good  convergence. 

We  first  reformulate  the  optimization  problem  ([6])  as  an 
unconstrained  regularized  risk  minimization  problem: 


min 

Ww,Wlfv],,W[fz] 


c'-EELlml+Ei?[i 

i  rn  m 


(7) 


where 

L[™]  =  max(0, 1  -  yiTr(w}m] Wfx\m])), 

R[m]  =  0)  +  f  A  •  Tr (Www\m]TWp]Wf), 

(8) 

are  empirical  loss  and  regularizer,  respectively. 

We  solve  the  above  problem  by  a  coordinate  descent  al¬ 
gorithm.  Specifically,  if  is  fixed,  the  optimization 

problem  is 

min  £  Tr(WwW™TW™WS) 

(9) 

+C,EE  max(0, 1  —  yiTr^W^W^xl™])). 

i  m 

To  efficiently  solve  this  problem,  we  define  A  = 
and  define  two  auxiliary  variables  Ww  = 

WwA%  and  xN  =  XiWlp]A~i.  Note  that  .4  is  a  matrix 
of  size  d  x  d  that  is  in  general  invertible  for  small  d.  Then 
the  problem  0  can  be  equivalently  rewritten  as 

min  EtYC WfWw)  +  CVV  max(0, 1  -  yiTr(w£x\m])) 

2  i  m 

(10) 

This  is  an  unconstrained  regularized  risk  minimization 
problem  equivalent  to  linear  SVM  if  Ww  and  are  vec¬ 
torized.  We  solve  this  problem  using  a  bundle  method.  Af¬ 
ter  learning  Ww ,  the  original  parameter  matrix  Ww  can  be 
reconstructed  by  Ww  =  WwA~  2. 

When  Ww  is  fixed,  W for  each  modality  can  be  opti¬ 
mized  in  a  similar  form  to  Eq.  0  and  ^  but  with  Ww  as 
constant.  We  define  B  =  W^Ww,  and  further  define  two 
auxiliary  variables,  Wf  and  Xi,  as  and 

X]m]  =  x\rn^TWwB~  i .  Then,  the  parameter  matrix 
for  each  modality  can  be  optimized  independently  by 

min -Tr(wf"]'IWJm])  +  A  I(3m\d) 

W[ml  2  1  1 

_  _  (11) 

+  max(0, 1  -  yiTr(wlm]Txlm])), 


Algorithm  1  Bilinear  IB  model  learning  algorithm 


1: 

2: 

3: 

4: 

5: 

6: 

7: 


9: 


10: 

11: 


Input:  =  1,  •  •  •  ,M). 

Output:  Ww,w\m]. 


Initialize  variables  Ww,wjm^. 

repeat 


Compute  A  =  Wj.m]TW)mj,  Ww  = 


and  X\‘ 


XiWpA-i. 


Fix  Wf  ,  and  optimize  Ww  by  ( 10 ). 

Recover  Ww  =  WwA~^ . _ 

Compute  B  =  WfWw,  W^m]  =  W[} 

X\m]  =x\m]TWwB I. 


WwA 


1 

2 


B  2 ,  and 


Fix  Ww,  and  optimize  Wf  and  Wf  indepen¬ 
dently  by  ©• 

Recover  =  W[f]B~ i  and  w'f  =  w\z]B~  i. 
until  Objective  changes  <  threshold. 


with  the  assumption  that  the  conditional  distribution 
p{Ww,B~^\X^rn\0)  is  a  uniform  distributiorj^]  This  is 
also  an  unconstrained  regularized  risk  minimization  prob¬ 
lem  for  linear  SVM  and  can  be  solved  by  a  bundle  algorithm 
if  Wf and  X are  unfolded  into  vectors.  We  repeat  this 

\v] 

step  twice,  each  of  which  is  fed  with  visual  features  X\  J 
or  depth  feature  x\z^ .  After  optimizing  can  be 

recovered  by  wjm]  =  w\m]B~^. 

The  proposed  BHIM  is  solved  by  iteratively  optimizing 
problems  (|T()|)  and  (fTT])  until  convergence.  This  is  a  bicon¬ 
vex  problem  as  optimizing  one  parameter  matrix  holding 
the  others  fixed  is  a  convex  problem.  The  learning  algo¬ 
rithm  is  shown  in  Algorithm  [T] 

3.3.  Discussion 

We  highlight  key  properties  of  the  proposed  BHIM  here. 

Matrix  form  feature  representation.  Visual  and  depth 
features  are  represented  in  a  matrix  form  in  BHIM,  which 
naturally  considers  spatiotemporal  motion  relationships  of 
body  parts.  However,  the  relationships  would  be  collapsed 
in  a  vector  form  representation  in  existing  methods  mm. 

Low-rank  bilinear  model.  BHIM  naturally  models  fea¬ 
ture  matrices  using  two  model  parameter  matrices  Wf  and 
Ww .  The  rank  of  the  proposed  model  is  minimized  to  pro¬ 
vide  a  better  generalization  power  m- 

Information  measure.  This  is  computed  in  the  pro¬ 
cess  of  data  projection  in  order  to  compress  data  and  reduce 
noise  in  the  learned  space.  We  validate  its  effectiveness  in 
the  experiments. 

Cross-modal  features.  Our  BHIM  learns  cross-modal 
features  from  RGB  and  depth  data.  The  cross-modal  fea- 


1  https://forge.lip6.fr/projects/nrbm 


2 Please  refer  to  supplemental  material  for  details. 
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Table  1.  Comparison  results  with  various  dimensionality  d  of  the  feature  space.  The  dimensionality  of  features  for  each  modality  in  linear 
SVM  is  nxyt  •  d. 


Methods 

d  =  1 

d  =  5 

d  =  31 

Depth 

RGB 

RGB-D 

Depth 

RGB 

RGB-D 

Depth 

RGB 

RGB-D 

linear  SVM 

47.22% 

42.78% 

51.67% 

72.78% 

70.00% 

75.00% 

86.11% 

87.22% 

87.78% 

bilinear  SVM 

53.89% 

50.00% 

70.56% 

90.00% 

87.22% 

91.11% 

92.78% 

80.00% 

96.11% 

Our  method 

83.33% 

91.11% 

96.11% 

88.33% 

76.11% 

98.33% 

93.89% 

97.22% 

100% 

tures  are  discriminative  for  classification  as  they  capture  im¬ 
plicit  correlations  between  RGB  and  depth  data,  and  inherit 
the  characteristics  of  them  including  motion,  3D  structural, 
and  spatiotemporal  correlation  information. 

Knowledge  transfer.  The  learned  projection  matrix 
or  transfers  information  from  original  data 
to  the  learned  shared  features  O.  This  helps  exploit  cross- 
modal  knowledge  if  one  modality  is  missing  in  testing. 

4.  Experiments 

4.1.  Datasets  and  Settings 

The  proposed  method  is  evaluated  on  the  MSR  Ac¬ 
tion  Pairs  dataset  CE3  and  MSR  Daily  Activity  dataset 
ED-  MSR  Action  Pairs  dataset  is  an  indoor  RGB-D  ac¬ 
tion  dataset  containing  12  types  of  activities  performed  by 
10  subjects  with  both  RGB  and  depth  videos.  Each  actor  re¬ 
peats  an  action  for  3  times,  to  provide  a  total  of  360  videos 
for  each  of  the  RGB  and  depth  modality.  MSR  Daily  Ac¬ 
tivity  dataset  contains  16  types  of  activities  performed  by 
10  subjects.  Each  actor  repeats  an  action  twice,  providing  a 
total  of  320  videos  for  each  of  the  RGB  and  depth  channels. 

4.2.  MSR  Action  Pairs  Dataset 

Videos  in  this  dataset  are  temporally  normalized  to  10 
frames  with  spatial  resolution  of  120  x  160.  Histograms  of 
gradient  oriented  feature  is  extracted  from  both  depth  and 
RGB  videos  with  patch  size  8x8.  Thus,  a  total  of  nxyt  = 
3000  patches  are  extracted  from  each  video,  with  the  feature 
dimensionality  of  ny  =  31.  We  follow  [13]  and  use  RGB-D 
videos  of  the  first  5  subjects  as  training  data. 

Comparison  experiment.  We  compare  with  existing 
method  G3  EJ  M  E2  □,  and  use  linear  SVM  as  base¬ 
line.  We  also  extend  the  bilinear  SVM  m  to  capture  two- 
modality  data,  and  use  it  as  baseline. 

Results  in  Table  [2]  show  that  our  method  outperforms  all 
the  comparison  approaches.  We  achieve  100%  accuracy  as 
we  effectively  use  both  visual  and  depth  features.  Com¬ 
pared  with  linear  SVM  that  simply  concatenates  the  two 
features  into  a  long  vector,  our  method  finds  the  optimal 
space  for  fusing  the  two  features,  and  thus  improves  the 
performance.  Although  bilinear  SVM  also  learns  a  shared 
feature  space  for  the  two  features,  our  method  uses  the  in¬ 
formation  measure  (j)(W^\w^)  in  Eq.  (El  to  compress 


data  and  reduce  noise  irrelevant  to  our  recognition  task. 
Our  method  also  outperforms  mmmm,  which  shows 
the  benefits  of  effectively  utilizing  both  visual  and  depth 
data,  and  representing  features  in  a  matrix  form.  Using  a 
matrix  form  feature  representation  allows  us  to  construct  a 
low-rank  bilinear  model  that  can  improve  the  generalization 
power.  The  learned  features  and  parameter  matrices  are  vi¬ 
sualized  in  Figure]?] 

Table  2.  Recognition  accuracy  of  comparison  methods  on  MSR 
Action  Pairs  dataset. 


Methods 

Accuracy 

linear  SVM 

87.78% 

Bilinear  SVM  Q4) 

96.11% 

Deep  Motion  Maps  f26l 

66.11% 

Skeleton+LOP+Pyramid  l2ll 

82.22% 

LTTL  CD 

91.48% 

HON4D  03 

96.67% 

snv  G3 

98.89% 

Our  method 

100% 

Sensitivity  to  parameters.  The  proposed  BHIM  has 
three  parameters  to  set,  the  maximum  rank  of  the  bilinear 
model  d ,  the  parameter  C  and  the  parameter  A  in  Eq.  ([2]).  In 
this  experiment,  we  investigate  the  sensitivity  of  BHIM  to 
these  parameters. 

We  first  test  the  sensitivity  of  BHIM  to  the  maximum 
rank  d.  BHIM  is  compared  with  linear  SVM  and  bilinear 
SVM  with  various  d  values.  Note  that  there  are  a  total  of 
nxyt  x  d  elements  in  the  shared  space  for  each  modality  in 
BHIM  and  bilinear  SVM.  To  conduct  a  fair  comparison,  for 
linear  SVM,  we  use  PCA  to  reduce  the  dimensionality  of 
feature  vectors  of  each  modality  to  nxyt  •  d ,  making  sure  all 
the  three  methods  have  the  same  number  of  elements  in  the 
low-dimensional  features.  The  projected  visual  and  depth 
features  are  concatenated  into  a  long  vector  and  fed  to  lin¬ 
ear  SVM.  In  bilinear  SVM  and  BHIM,  the  original  feature 
matrix  X ^  is  projected  by  .  The  rank  parameter  d  is 

set  to  1,  5,  and  31,  respectively. 

The  performance  of  the  three  methods  on  depth  features, 
RGB  features,  and  RGB-D  features  are  shown  in  Table  [I] 
Results  indicate  that  our  method  achieves  higher  perfor¬ 
mance  in  most  of  the  cases  given  low-dimensional  features, 
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Figure  4.  Visualizations  of  (a)  the  projected  visual  features  X^W^\  (b)  the  projected  depth  features  X^W^Z\  (c)  the  learned  cross- 
modal  features  O  in  the  shared  space,  and  the  parameter  matrices  (d)  WfV\  (e)  WfZ\  and  (f)  Ww. 


Table  3.  Knowledge  transfer  results  on  MSR  Action  Pairs  dataset.  X  — >>  Y  denotes  that  X  is  the  training  data  and  Y  is  the  testing  data. 
d  —  31  for  both  bilinear  SVM  and  BHIM,  and  dimensionality  of  features  in  linear  SVM  is  nxyt  •  d.  The  number  of  elements  in  the  input 
feature  vector/matrix  to  the  three  methods  is  the  same. 


Methods 

RGB-D^RGB 

RGB-D^Depth 

RGB^RGB-D 

Depth^RGB-D 

linear  SVM 

83.33% 

81.67% 

87.22% 

86.11% 

Bilinear  SVM 

90.56% 

93.89% 

81.67% 

91.67% 

Our  method 

97.78% 

92.78% 

97.78% 

93.33% 

and  its  performance  on  RGB-D  data  is  not  sensitive  to  pa¬ 
rameter  d.  When  d  —  1,  the  projected  feature  matrices  may 
lose  certain  amount  of  information.  However,  the  struc¬ 
tural  information  is  reserved  in  BHIM,  resulting  in  signifi¬ 
cant  higher  performance  over  linear  SVM.  In  addition,  the 
learned  shared  space  in  BHIM  is  optimized  for  classifica¬ 
tion,  while  it  is  not  the  case  in  PCA.  Compared  with  bilin¬ 
ear  SVM,  noisy  information  is  reduced  in  BHIM,  and  thus 
it  achieves  superior  performance. 

When  d  —  31,  even  though  linear  SVM  captures  full 
information  from  visual  and  depth  features,  it  does  not  cap¬ 
ture  spatiotemporal  relationship  information  due  to  its  vec¬ 
tor  form  feature  representation.  In  addition,  depth  and  RGB 
features  are  concatenated  in  linear  SVM,  suggesting  that  the 
similarities  between  the  two  types  of  features  are  directly 
compared.  This  may  not  be  appropriate  since  they  are  from 
different  distributions.  In  contrast,  our  BHIM  solves  these 
two  problems  by  a  matrix  form  feature  representation  and 
learning  a  shared  feature  space.  The  matrix  form  represen¬ 
tation  naturally  captures  spatiotemporal  body  part  correla¬ 
tions.  The  learning  of  a  shared  feature  space  allows  us  to 
effectively  use  the  two  types  of  features  for  recognition. 

BHIM  achieves  lower  results  on  depth-only  and  RGB- 
only  data  compared  with  bilinear  SVM  when  d  =  5.  This 
is  because  the  learned  cross-modal  features  in  BHIM  loses 
too  much  discriminative  information  using  the  information 
measure  (j)(W^\  )  in  Eq.  jij).  However,  when  use  both 

of  the  two  modalities,  BHIM  outperforms  bilinear  SVM 
since  the  discriminative  information  missing  in  one  modal¬ 
ity  can  be  complemented  from  the  other  available  modality. 


RGB-D  action  recognition  results  of  BHIM  with  differ¬ 
ent  values  of  parameter  C  are  shown  in  Table  [4]  Results 
indicate  that  our  BHIM  is  insensitive  to  parameter  C  when 
the  value  of  parameter  C  is  lower  than  1.  However,  the  per¬ 
formance  drops  when  the  value  becomes  large. 


Table  4.  RGB-D  action  recognition  results  of  our  BHIM  on  MSR 
Action  Pairs  dataset  with  different  values  of  parameter  C. 


C  value 

C  =  0.01 

<7  =  0.1 

<7  =  1 

<7  =  5 

Accuracy 

97.22% 

98.33% 

97.22% 

64.44% 

We  also  evaluate  the  performance  of  BHIM  given  differ¬ 
ent  values  of  A.  The  value  of  A  is  set  to  0.01,  0.1,  1,  and  10. 
Results  in  Table  [5]  indicate  that  BHIM  is  insensitive  to  pa¬ 
rameter  A.  The  largest  performance  difference  is  about  5% 
between  A  =  0.01  and  A  =  0.1.  The  insensitivity  of  BHIM 
to  parameter  A  significantly  saves  time  in  parameter  tuning. 


Table  5.  RGB-D  action  recognition  results  of  our  BHIM  on  MSR 
Action  Pairs  dataset  with  different  values  of  parameter  A. 


A  value 

A  =  0.01 

A  =  0.1 

A  =  1 

A  =  10 

Accuracy 

97.22% 

98.33% 

92.78% 

95.00% 

Knowledge  Transfer.  We  evaluate  the  performance 
of  our  BHIM,  and  investigate  the  effectiveness  of  the 
cross-modal  features  and  the  information  measure  when 
one  modality  is  missing  in  training  or  testing.  BHIM 
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is  tested  in  four  scenarios:  depth  data  are  missing  in 
testing  (RGB-D— )-RGB),  RGB  data  are  missing  in  test¬ 
ing  (RGB-D— )>Depth),  depth  data  are  missing  in  training 
(RGB— ^RGB-D),  and  RGB  data  are  missing  in  training 
(Depth^RGB-D).  We  compare  BHIM  with  linear  SVM 
and  bilinear  SVM,  and  investigate  how  the  knowledge  trans¬ 
ferred  from  observed  modality  influences  the  performance 
of  the  three  methods. 

Recognition  results  in  Table  [3]  show  that  BHIM  signifi¬ 
cantly  outperforms  linear  and  bilinear  SVM  in  this  knowl¬ 
edge  transfer  experiment.  Our  BHIM  achieves  significantly 
higher  accuracy  than  linear  SVM.  This  demonstrates  the  su¬ 
periority  of  using  a  matrix  form  feature  representation  and 
the  learned  cross-modal  features  in  BHIM.  Compared  with 
bilinear  SVM,  BHIM  also  achieves  superior  results  in  most 
cases.  Thanks  to  the  information  measure  in  learning  the 
projection  matrices,  BHIM  is  capable  of  reducing  noise  in 
learning  the  shared  feature  space,  and  thus  outperforms  bi¬ 
linear  SVM. 

4.3.  MSR  Daily  Activity  Dataset 

RGB  and  depth  sequences  in  this  dataset  are  spatially 
and  temporally  normalized,  and  the  people  of  interest  are 
extracted  from  these  sequences.  We  follow  the  same  train¬ 
ing  protocol  in  f2T1.  BHIM  is  first  compared  with  existing 
approaches  [26  ,  H,  27  ,  T3II2T1I251  on  this  dataset,  and  then 
evaluated  given  RGB,  depth,  and  RGB-D  data,  respectively. 
Linear  SVM  and  bilinear  SVM  are  used  as  baseline. 

Table  6.  Recognition  accuracy  of  comparison  methods  on  MSR 
Daily  Activity  Dataset. 


Methods 

Accuracy 

linear  SVM 

65.00% 

Bilinear  SVM 

85.63% 

Depth  Motion  Maps  l26l 

43.13% 

RGGP  fm 

72.10% 

Moving  Pose  (27l 

73.80% 

Local  HON4D  G3 

80.00% 

Actionlet  Ensemble  [21  j 

85.75% 

SNV  (25) 

86.25% 

Our  method 

86.88% 

BHIM  is  compared  with  existing  approaches  |!26j  UT1 
M  M  ED  ESI,  and  results  are  shown  in  Table  [6]  BHIM 
achieves  superior  performance  over  state-of-the-art  meth¬ 
ods.  BHIM  significantly  outperforms  linear  SVM  possi¬ 
bly  due  to  the  learning  of  a  shared  feature  space  for  the 
two  types  of  features,  and  a  matrix  form  representation 
that  naturally  captures  spatiotemporal  structural  informa¬ 
tion.  Recognition  accuracy  of  BHIM  is  also  higher  than 
bilinear  SVM  due  to  the  use  of  information  measure,  which 
is  helpful  in  removing  redundant  information  and  noise. 


BHIM  outperforms  recent  surface  normal-based  approaches 
mm.  Although  these  approaches  essentially  capture 
structural  information  in  the  feature  design  stage,  they  only 
focus  on  depth  sequences,  and  do  not  utilize  valuable  visual 
information.  In  addition,  the  two  approaches  use  the  full 
length  feature  vectors  and  do  not  learn  a  better  feature  space 
for  classification.  BHIM  achieves  better  performance  than 
the  actionlet  ensemble  approach  Eli  since  we  elegantly  use 
visual  and  depth  information,  and  effectively  compress  in¬ 
formative  cues  and  remove  noise  before  classification. 

Performance  of  the  proposed  BHIM  on  the  RGB -only, 
depth-only,  and  RGB-D  data  in  the  MSR  Daily  Activity 
dataset  is  also  reported  in  this  paper.  Linear  SVM  and  bi¬ 
linear  SVM  are  adopted  as  baseline.  Recognition  accuracy 
in  Table  [7]  shows  that  BHIM  achieves  satisfactory  results 
even  though  only  one  modality  of  features  is  given.  When 
only  depth  features  are  given,  linear  SVM  simply  uses  the 
features  in  the  original  feature  space  for  classification.  By 
contrast,  our  BHIM  finds  a  better  feature  space  to  remove 
noise  in  order  to  achieve  better  performance.  Compared 
with  bilinear  SVM,  BHIM  also  utilizes  information  mea¬ 
sure  to  compress  data,  and  elegantly  reduces  redundancy  in 
the  data,  which  facilitates  the  recognition  task. 


Table  7.  Comparison  results  on  MSR  Daily  Activity  Dataset  given 
depth-only,  RGB -only,  and  RGB-D  data. 


Methods 

Depth 

RGB 

Depth+RGB 

linear  SVM 

61.88% 

54.38% 

65.00% 

Bilinear  SVM 

72.50% 

67.50% 

81.88% 

Our  method 

81.88% 

77.50% 

86.88% 

5.  Conclusion 

We  have  proposed  a  bilinear  heterogeneous  information 
machine  (BHIM)  for  action  recognition  from  RGB-D  se¬ 
quences.  Both  RGB  and  depth  data  are  effectively  utilized, 
and  used  to  learn  cross-modal  features  for  recognition.  We 
represent  both  visual  and  depth  features  in  a  matrix  form  to 
capture  spatiotemporal  relationships.  A  novel  low-rank  bi¬ 
linear  classifier  is  proposed  to  naturally  model  these  feature 
matrices.  BHIM  learns  a  shared  space  for  fusing  RGB  and 
depth  data,  and  produces  the  cross-modal  features.  A  large 
amount  of  noise  is  reduced  in  BHIM  using  the  information 
measure.  Classification  is  performed  in  the  shared  space 
using  the  learned  cross-modal  features.  We  learn  a  low- 
rank  BHIM  by  directly  minimizing  the  rank  of  the  model, 
in  order  to  increase  the  generalization  power.  An  efficient 
optimization  algorithm  is  proposed  in  this  work  with  an  off- 
the-shelf  SVM  solver  as  the  inner  optimization  solver.  The 
BHIM  is  extensively  evaluated  on  two  public  RGB-D  action 
datasets,  and  outperforms  state-of-the-art  approaches. 


1061 


Acknowledgement 

This  research  is  supported  in  part  by  the  NSF  CNS  award 
1314484,  ONR  award  N00014-12-1-1028,  ONR  Young  In¬ 
vestigator  Award  N00014-14-1-0484,  and  U.S.  Army  Re¬ 
search  Office  Young  Investigator  Award  W911NF-14-1- 
0218. 

References 

[1]  A.  Argyriou,  T.  Evgeniou,  and  M.  Pontil.  Convex  multi-task 
feature  learning.  IJCV ,  2008. 

[2]  L.  Bo,  K.  Lai,  X.  Ren,  and  D.  Fox.  Object  recognition  with 
hierarchical  kernel  descriptors.  In  CVPR,  June  2011. 

[3]  L.  Chen,  W.  Li,  and  D.  Xu.  Recognizing  RGB  images  by 
learning  from  RGB-D  data.  In  CVPR ,  2014. 

[4]  T.-M.-T.  Do  and  T.  Artieres.  Large  margin  training  for  hid¬ 
den  markov  models  with  partially  observed  states.  In  ICML , 
2009. 

[5]  S.  Hadfield  and  R.  Bowden.  Hollywood  3D:  Recognizing 
actions  in  3D  natural  scenes.  In  CVPR ,  2013. 

[6]  S.  Ji,  W.  Xu,  M.  Yang,  and  K.  Yu.  3D  convolutional  neural 
networks  for  human  action  recognition.  PAMI ,  2013. 

[7]  C.  Jia,  Y.  Kong,  Z.  Ding,  and  Y.  Fu.  Latent  tensor  transfer 
learning  for  RGB-D  action  recognition.  In  ACM  Multimedia , 
2014. 

[8]  T.  Kobayashi.  Low-rank  biliner  classification:  Efficient  con¬ 
vex  optimization  and  extensions.  IJCV ,  2014. 

[9]  Y.  Kong,  Y.  Jia,  and  Y.  Fu.  Interactive  phrases:  Semantic  de¬ 
scriptions  for  human  interaction  recognition.  In  PAMI ,  2014. 

[10]  W.  Li,  Z.  Zhang,  and  Z.  Liu.  Action  recognition  based  on  a 
bag  of  3D  points.  In  CVPR  workshop ,  2010. 

[11]  L.  Liu  and  L.  Shao.  Learning  discriminative  representations 
from  RGB-D  video  data.  In  IJCAI,  2013. 

[12]  J.  Luo,  W.  Wang,  and  H.  Qi.  Group  sparsity  and  geometry 
constrained  dictionary  learning  for  action  recognition  from 
depth  maps.  In  ICCV. ,  2013. 

[13]  O.  Oreifej  and  Z.  Liu.  HON4D:  Histogram  of  oriented  4D 
normals  for  activity  recognition  from  depth  sequences.  In 
CVPR,  2013. 


[14]  H.  Pirsiavash,  D.  Ramanan,  and  C.  Fowlkes.  Bilinear  classi¬ 
fiers  for  visual  recognition.  In  NIPS,  2009. 

[15]  M.  Raptis  and  L.  Sigal.  Poselet  key-framing:  A  model  for 
human  activity  recognition.  In  CVPR,  2013. 

[16]  J.  Shotton,  R.  Girshick,  A.  Fitzgibbon,  T.  Sharp,  M.  Cook, 
M.  Finocchio,  R.  Moore,  P.  Kohli,  A.  Criminisi,  A.  Kipman, 
and  A.  Blake.  Efficient  human  pose  estimation  from  single 
depth  images.  PAMI,  2013. 

[17]  K.  Tang,  L.  Fei-Fei,  and  D.  Koller.  Learning  latent  temporal 
structure  for  complex  event  detection.  In  CVPR,  2012. 

[18]  J.  B.  Tenenbaum  and  W.  T.  Freeman.  Separating  style  and 
content  with  bilinear  models.  Neural  Computation,  2000. 

[19]  C.  H.  Teo,  Q.  Le,  A.  Smola,  and  S.  Vishwanathan.  A  scalable 
modular  convex  solver  for  regularized  risk  minimization.  In 
KDD,  2007. 

[20]  N.  Tishby,  F.  C.  Pereira,  and  W.  Bialek.  The  information  bot¬ 
tleneck  method.  In  Proc.  of  the  37-th  Annual  Allerton  Con¬ 
ference  on  Communication,  Control  and  Computing,  pages 
368-377,1999. 

[21]  J.  Wang,  Z.  Liu,  Y.  Wu,  and  J.  Yuan.  Mining  actionlet  en¬ 
semble  for  action  recognition  with  depth  cameras.  In  CVPR, 
June  2012. 

[22]  L.  Wolf,  H.  Jhuang,  and  T.  Hazan.  Modeling  appearances 
with  low-rank  svm.  In  CVPR,  2007. 

[23]  L.  Xia  and  J.  Aggarwal.  Spatio-temporal  depth  cuboid  simi¬ 
larity  feature  for  activity  recognition  using  depth  camera.  In 
CVPR,  2013. 

[24]  C.  Xu,  D.  Tao,  and  C.  Xu.  Large-margin  multi-view  infor¬ 
mation  bottleneck.  PAMI,  36(8),  2014. 

[25]  X.  Yang  and  Y.  Tian.  Super  normal  vector  for  activity  recog¬ 
nition  using  depth  sequences.  In  CVPR,  2014. 

[26]  X.  Yang,  C.  Zhang,  and  Y.  Tian.  Recognizing  actions  using 
depth  motion  maps-based  histograms  of  oriented  gradients. 
In  ACM  Multimedia,  2012. 

[27]  M.  Zanfir,  M.  Leordeanu,  and  C.  Sminchisescu.  The  moving 
pose:  An  efficient  3D  kinematics  descriptor  for  low-latency 
action  recognition  and  detection.  In  ICCV,  2013. 


1062 


